{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# Multivariate Regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Let's grab a small little data set of Blue Book car values:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.read_excel('http://cdn.sundog-soft.com/Udemy/DataScience/cars.xls')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Price</th>\n",
       "      <th>Mileage</th>\n",
       "      <th>Make</th>\n",
       "      <th>Model</th>\n",
       "      <th>Trim</th>\n",
       "      <th>Type</th>\n",
       "      <th>Cylinder</th>\n",
       "      <th>Liter</th>\n",
       "      <th>Doors</th>\n",
       "      <th>Cruise</th>\n",
       "      <th>Sound</th>\n",
       "      <th>Leather</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>17314.103129</td>\n",
       "      <td>8221</td>\n",
       "      <td>Buick</td>\n",
       "      <td>Century</td>\n",
       "      <td>Sedan 4D</td>\n",
       "      <td>Sedan</td>\n",
       "      <td>6</td>\n",
       "      <td>3.1</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>17542.036083</td>\n",
       "      <td>9135</td>\n",
       "      <td>Buick</td>\n",
       "      <td>Century</td>\n",
       "      <td>Sedan 4D</td>\n",
       "      <td>Sedan</td>\n",
       "      <td>6</td>\n",
       "      <td>3.1</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>16218.847862</td>\n",
       "      <td>13196</td>\n",
       "      <td>Buick</td>\n",
       "      <td>Century</td>\n",
       "      <td>Sedan 4D</td>\n",
       "      <td>Sedan</td>\n",
       "      <td>6</td>\n",
       "      <td>3.1</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>16336.913140</td>\n",
       "      <td>16342</td>\n",
       "      <td>Buick</td>\n",
       "      <td>Century</td>\n",
       "      <td>Sedan 4D</td>\n",
       "      <td>Sedan</td>\n",
       "      <td>6</td>\n",
       "      <td>3.1</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>16339.170324</td>\n",
       "      <td>19832</td>\n",
       "      <td>Buick</td>\n",
       "      <td>Century</td>\n",
       "      <td>Sedan 4D</td>\n",
       "      <td>Sedan</td>\n",
       "      <td>6</td>\n",
       "      <td>3.1</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          Price  Mileage   Make    Model      Trim   Type  Cylinder  Liter  \\\n",
       "0  17314.103129     8221  Buick  Century  Sedan 4D  Sedan         6    3.1   \n",
       "1  17542.036083     9135  Buick  Century  Sedan 4D  Sedan         6    3.1   \n",
       "2  16218.847862    13196  Buick  Century  Sedan 4D  Sedan         6    3.1   \n",
       "3  16336.913140    16342  Buick  Century  Sedan 4D  Sedan         6    3.1   \n",
       "4  16339.170324    19832  Buick  Century  Sedan 4D  Sedan         6    3.1   \n",
       "\n",
       "   Doors  Cruise  Sound  Leather  \n",
       "0      4       1      1        1  \n",
       "1      4       1      1        0  \n",
       "2      4       1      1        0  \n",
       "3      4       1      0        0  \n",
       "4      4       1      0        1  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "We can use pandas to split up this matrix into the feature vectors we're interested in, and the value we're trying to predict.\n",
    "\n",
    "Note how we are avoiding the make and model; regressions don't work well with ordinal values, unless you can convert them into some numerical order that makes sense somehow.\n",
    "\n",
    "Let's scale our feature data into the same range so we can easily compare the coefficients we end up with."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "      Mileage  Cylinder     Doors\n",
      "0   -1.417485  0.527410  0.556279\n",
      "1   -1.305902  0.527410  0.556279\n",
      "2   -0.810128  0.527410  0.556279\n",
      "3   -0.426058  0.527410  0.556279\n",
      "4    0.000008  0.527410  0.556279\n",
      "5    0.293493  0.527410  0.556279\n",
      "6    0.335001  0.527410  0.556279\n",
      "7    0.382369  0.527410  0.556279\n",
      "8    0.511409  0.527410  0.556279\n",
      "9    0.914768  0.527410  0.556279\n",
      "10  -1.171368  0.527410  0.556279\n",
      "11  -0.581834  0.527410  0.556279\n",
      "12  -0.390532  0.527410  0.556279\n",
      "13  -0.003899  0.527410  0.556279\n",
      "14   0.430591  0.527410  0.556279\n",
      "15   0.480156  0.527410  0.556279\n",
      "16   0.509822  0.527410  0.556279\n",
      "17   0.757160  0.527410  0.556279\n",
      "18   1.594886  0.527410  0.556279\n",
      "19   1.810849  0.527410  0.556279\n",
      "20  -1.326046  0.527410  0.556279\n",
      "21  -1.129860  0.527410  0.556279\n",
      "22  -0.667658  0.527410  0.556279\n",
      "23  -0.405792  0.527410  0.556279\n",
      "24  -0.112796  0.527410  0.556279\n",
      "25  -0.044552  0.527410  0.556279\n",
      "26   0.190700  0.527410  0.556279\n",
      "27   0.337442  0.527410  0.556279\n",
      "28   0.566102  0.527410  0.556279\n",
      "29   0.660837  0.527410  0.556279\n",
      "..        ...       ...       ...\n",
      "774 -0.161262 -0.914896  0.556279\n",
      "775 -0.089234 -0.914896  0.556279\n",
      "776 -0.040523 -0.914896  0.556279\n",
      "777  0.002572 -0.914896  0.556279\n",
      "778  0.236603 -0.914896  0.556279\n",
      "779  0.249666 -0.914896  0.556279\n",
      "780  0.357220 -0.914896  0.556279\n",
      "781  0.365521 -0.914896  0.556279\n",
      "782  0.434131 -0.914896  0.556279\n",
      "783  0.517269 -0.914896  0.556279\n",
      "784  0.589908 -0.914896  0.556279\n",
      "785  0.599186 -0.914896  0.556279\n",
      "786  0.793052 -0.914896  0.556279\n",
      "787  1.033554 -0.914896  0.556279\n",
      "788  1.045762 -0.914896  0.556279\n",
      "789  1.205567 -0.914896  0.556279\n",
      "790  1.541414 -0.914896  0.556279\n",
      "791  1.561070 -0.914896  0.556279\n",
      "792  1.725026 -0.914896  0.556279\n",
      "793  1.851502 -0.914896  0.556279\n",
      "794 -1.709871  0.527410  0.556279\n",
      "795 -1.474375  0.527410  0.556279\n",
      "796 -1.187849  0.527410  0.556279\n",
      "797 -1.079929  0.527410  0.556279\n",
      "798 -0.682430  0.527410  0.556279\n",
      "799 -0.439853  0.527410  0.556279\n",
      "800 -0.089966  0.527410  0.556279\n",
      "801  0.079605  0.527410  0.556279\n",
      "802  0.750446  0.527410  0.556279\n",
      "803  1.932565  0.527410  0.556279\n",
      "\n",
      "[804 rows x 3 columns]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\Frank\\AppData\\Local\\Enthought\\Canopy\\edm\\envs\\User\\lib\\site-packages\\sklearn\\utils\\validation.py:429: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.\n",
      "  warnings.warn(msg, _DataConversionWarning)\n",
      "C:\\Users\\Frank\\AppData\\Local\\Enthought\\Canopy\\edm\\envs\\User\\lib\\site-packages\\ipykernel\\__main__.py:8: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame.\n",
      "Try using .loc[row_indexer,col_indexer] = value instead\n",
      "\n",
      "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n",
      "C:\\Users\\Frank\\AppData\\Local\\Enthought\\Canopy\\edm\\envs\\User\\lib\\site-packages\\pandas\\core\\indexing.py:477: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame.\n",
      "Try using .loc[row_indexer,col_indexer] = value instead\n",
      "\n",
      "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n",
      "  self.obj[item] = s\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<table class=\"simpletable\">\n",
       "<caption>OLS Regression Results</caption>\n",
       "<tr>\n",
       "  <th>Dep. Variable:</th>          <td>Price</td>      <th>  R-squared:         </th> <td>   0.064</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Model:</th>                   <td>OLS</td>       <th>  Adj. R-squared:    </th> <td>   0.060</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Method:</th>             <td>Least Squares</td>  <th>  F-statistic:       </th> <td>   18.11</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Date:</th>             <td>Mon, 15 May 2017</td> <th>  Prob (F-statistic):</th> <td>2.23e-11</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Time:</th>                 <td>10:31:39</td>     <th>  Log-Likelihood:    </th> <td> -9207.1</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>No. Observations:</th>      <td>   804</td>      <th>  AIC:               </th> <td>1.842e+04</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Residuals:</th>          <td>   801</td>      <th>  BIC:               </th> <td>1.843e+04</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Model:</th>              <td>     3</td>      <th>                     </th>     <td> </td>    \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Covariance Type:</th>      <td>nonrobust</td>    <th>                     </th>     <td> </td>    \n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "      <td></td>        <th>coef</th>     <th>std err</th>      <th>t</th>      <th>P>|t|</th>  <th>[0.025</th>    <th>0.975]</th>  \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Mileage</th>  <td>-1272.3412</td> <td>  804.623</td> <td>   -1.581</td> <td> 0.114</td> <td>-2851.759</td> <td>  307.077</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Cylinder</th> <td> 5587.4472</td> <td>  804.509</td> <td>    6.945</td> <td> 0.000</td> <td> 4008.252</td> <td> 7166.642</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Doors</th>    <td>-1404.5513</td> <td>  804.275</td> <td>   -1.746</td> <td> 0.081</td> <td>-2983.288</td> <td>  174.185</td>\n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "  <th>Omnibus:</th>       <td>157.913</td> <th>  Durbin-Watson:     </th> <td>   0.008</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Prob(Omnibus):</th> <td> 0.000</td>  <th>  Jarque-Bera (JB):  </th> <td> 257.529</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Skew:</th>          <td> 1.278</td>  <th>  Prob(JB):          </th> <td>1.20e-56</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Kurtosis:</th>      <td> 4.074</td>  <th>  Cond. No.          </th> <td>    1.03</td>\n",
       "</tr>\n",
       "</table>"
      ],
      "text/plain": [
       "<class 'statsmodels.iolib.summary.Summary'>\n",
       "\"\"\"\n",
       "                            OLS Regression Results                            \n",
       "==============================================================================\n",
       "Dep. Variable:                  Price   R-squared:                       0.064\n",
       "Model:                            OLS   Adj. R-squared:                  0.060\n",
       "Method:                 Least Squares   F-statistic:                     18.11\n",
       "Date:                Mon, 15 May 2017   Prob (F-statistic):           2.23e-11\n",
       "Time:                        10:31:39   Log-Likelihood:                -9207.1\n",
       "No. Observations:                 804   AIC:                         1.842e+04\n",
       "Df Residuals:                     801   BIC:                         1.843e+04\n",
       "Df Model:                           3                                         \n",
       "Covariance Type:            nonrobust                                         \n",
       "==============================================================================\n",
       "                 coef    std err          t      P>|t|      [0.025      0.975]\n",
       "------------------------------------------------------------------------------\n",
       "Mileage    -1272.3412    804.623     -1.581      0.114   -2851.759     307.077\n",
       "Cylinder    5587.4472    804.509      6.945      0.000    4008.252    7166.642\n",
       "Doors      -1404.5513    804.275     -1.746      0.081   -2983.288     174.185\n",
       "==============================================================================\n",
       "Omnibus:                      157.913   Durbin-Watson:                   0.008\n",
       "Prob(Omnibus):                  0.000   Jarque-Bera (JB):              257.529\n",
       "Skew:                           1.278   Prob(JB):                     1.20e-56\n",
       "Kurtosis:                       4.074   Cond. No.                         1.03\n",
       "==============================================================================\n",
       "\n",
       "Warnings:\n",
       "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
       "\"\"\""
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import statsmodels.api as sm\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "scale = StandardScaler()\n",
    "\n",
    "X = df[['Mileage', 'Cylinder', 'Doors']]\n",
    "y = df['Price']\n",
    "\n",
    "X[['Mileage', 'Cylinder', 'Doors']] = scale.fit_transform(X[['Mileage', 'Cylinder', 'Doors']].as_matrix())\n",
    "\n",
    "print (X)\n",
    "\n",
    "est = sm.OLS(y, X).fit()\n",
    "\n",
    "est.summary()"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "The table of coefficients above gives us the values to plug into an equation of form:\n",
    "    B0 + B1 * Mileage + B2 * model_ord + B3 * doors\n",
    "    \n",
    "In this example, it's pretty clear that the number of cylinders is more important than anything based on the coefficients.\n",
    "\n",
    "Could we have figured that out earlier?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Doors\n",
       "2    23807.135520\n",
       "4    20580.670749\n",
       "Name: Price, dtype: float64"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y.groupby(df.Doors).mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Surprisingly, more doors does not mean a higher price! (Maybe it implies a sport car in some cases?) So it's not surprising that it's pretty useless as a predictor here. This is a very small data set however, so we can't really read much meaning into it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## Activity"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Mess around with the fake input data, and see if you can create a measurable influence of number of doors on price. Have some fun with it - why stop at 4 doors?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
